Demonstrates autograd integration with NVFuser multidevice #3787

syed-ahmed · 2025-01-29T08:03:48Z

This PR demonstrates how to wrap a forward and a backward fusion definition in a torch.autograd.Function that takes PyTorch DTensors as input and outputs PyTorch DTensors.

wujingyue · 2025-01-29T18:01:50Z

Cool -- add me to reviewers when it's ready!

syed-ahmed · 2025-01-29T18:18:59Z

@wujingyue To review.

syed-ahmed · 2025-01-29T18:19:39Z

Oops I can't add to the reviewers list.

wujingyue

LGTM otherwise

tests/python/test_dtensor.py

wujingyue · 2025-01-29T22:58:07Z

tests/python/test_dtensor.py

+            self.s = sequence
+            self.e = hidden
+
+    class LinearForwardDefinition(FusionDefintionArguments):


I feel using class and inheritance is an overkill. Functions and partials should be good enough.

def define_linear_forward(config: LinearConfig, fd: FusionDefinition) -> None:

and later

partial(define_linear_forward, config)

wujingyue · 2025-01-29T22:59:44Z

tests/python/test_dtensor.py

+        ):
+            b, s, e = input._local_tensor.shape
+            d = weight.device_mesh.size()
+            op = FusionDefinitionWrapper(LinearForwardDefinition(d, b, s, e))


Can you try to construct the op in __init__? Example: https://github.com/canqin001/PointDAN/blob/5001b38cb5506b1c6b40ad1329c1d6f4fbbdd26d/Model.py#L29. I'm worried about the overhead of constructing FusionDefinitionWrapper for each forward and backward call.

I don't think we can create init for torch.autograd.Function: https://github.com/pytorch/pytorch/blob/main/torch/autograd/function.py#L498-L506. Also looks like the code from the link is quite old. I get the following error when trying to add __init__ and use .apply:

# When calling, LinearFunction(d, b, s, e).apply(inp_dtensor, weight_dtensor, bias_dtensor) [rank0]: Traceback (most recent call last): [rank0]: File "/opt/home/dtensor_extension/test.py", line 207, in <module> [rank0]: out_dtensor = LinearFunction(d, b, s, e).apply(inp_dtensor, weight_dtensor, bias_dtensor) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/usr/local/lib/python3.12/dist-packages/torch/autograd/function.py", line 575, in apply [rank0]: return super().apply(*args, **kwargs) # type: ignore[misc] [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/opt/home/dtensor_extension/test.py", line 161, in forward [rank0]: outputs = self.forward_op([input, weight, bias]) [rank0]: ^^^^ [rank0]: NameError: name 'self' is not defined

Thanks for confirming! We don't need to figure this out for this PR, but I still wonder how to avoid creating a FusionDefinitionWrapper for each call.

tests/python/test_dtensor.py

syed-ahmed · 2025-02-11T01:12:54Z

!test

syed-ahmed · 2025-02-11T01:16:13Z

!test

syed-ahmed · 2025-02-11T01:18:26Z

@wujingyue Addressed the review.

wujingyue · 2025-02-11T06:01:58Z

There are some real errors in nvfuser-ci/jit_python_distributed_tests_20_A100.

syed-ahmed · 2025-02-12T00:09:19Z

!test

syed-ahmed · 2025-02-12T18:09:46Z

@wujingyue This is ready to be merged.

syed-ahmed added 2 commits January 29, 2025 07:59

Initial commit

56c2719

lint

8935063

wujingyue self-requested a review January 29, 2025 22:52

wujingyue reviewed Jan 29, 2025

View reviewed changes

Addresses review

8d11f75

syed-ahmed requested a review from wujingyue February 11, 2025 01:14

Addresses review

4c8b6fb

Lint

9c42d15

wujingyue approved these changes Feb 11, 2025

View reviewed changes

Fixes typo

07fcc38

wujingyue merged commit b0d36aa into NVIDIA:main Feb 12, 2025
52 of 53 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Demonstrates autograd integration with NVFuser multidevice #3787

Demonstrates autograd integration with NVFuser multidevice #3787

syed-ahmed commented Jan 29, 2025 •

edited

Loading

wujingyue commented Jan 29, 2025

syed-ahmed commented Jan 29, 2025

syed-ahmed commented Jan 29, 2025

wujingyue left a comment

wujingyue Jan 29, 2025

wujingyue Jan 29, 2025

syed-ahmed Feb 11, 2025

wujingyue Feb 11, 2025

syed-ahmed commented Feb 11, 2025

syed-ahmed commented Feb 11, 2025

syed-ahmed commented Feb 11, 2025

wujingyue commented Feb 11, 2025

syed-ahmed commented Feb 12, 2025

syed-ahmed commented Feb 12, 2025

Demonstrates autograd integration with NVFuser multidevice #3787

Demonstrates autograd integration with NVFuser multidevice #3787

Conversation

syed-ahmed commented Jan 29, 2025 • edited Loading

wujingyue commented Jan 29, 2025

syed-ahmed commented Jan 29, 2025

syed-ahmed commented Jan 29, 2025

wujingyue left a comment

Choose a reason for hiding this comment

wujingyue Jan 29, 2025

Choose a reason for hiding this comment

wujingyue Jan 29, 2025

Choose a reason for hiding this comment

syed-ahmed Feb 11, 2025

Choose a reason for hiding this comment

wujingyue Feb 11, 2025

Choose a reason for hiding this comment

syed-ahmed commented Feb 11, 2025

syed-ahmed commented Feb 11, 2025

syed-ahmed commented Feb 11, 2025

wujingyue commented Feb 11, 2025

syed-ahmed commented Feb 12, 2025

syed-ahmed commented Feb 12, 2025

syed-ahmed commented Jan 29, 2025 •

edited

Loading